logo

0.1 Whats its all About?

This is a exhaustive analytics report for getting a clear insights on the video game industry, from its very primitive stage to the peek of the video game industries.the data set is from Kaggle

This report answers Question on Video Game industries some of them are Stated below

0.1.1 Introducing Our Source The Data

The data is in Csv(Comma Separated) format,the dimensions are r< dim(df) >. The names of all the column and their meanings are stated below:-
Atrributes meanings
Rank Rank of video game
Name Name of video game
Platform Platform for which it is developed
Year Year of release
Genre type of game
Publisher Publisher/Developing Company
NA_Sales North Amrica total Sales
EU_Sales Europe total Sales
JP_Sales Japan Sales
Other_Sales Sales in all other Countries
Global_Sales Global total Sales

there are missing data in the csv so we have to clean the data and also tidy the data

0.1.2 Data Wrangling

Data Wrangling is the term collectively given to Data Cleaning And Data Tidying in this process do the following things :-

  • Check data Consistency,duplicates
  • Check for Missing Data
  • Check For Outlines
  • Found a strong reason before removing Outliers
  • Fill the Missing Values
  • Fill the the corrupted Data with proper data
  • Feature Engineering-process of making new Features

Lets get hands on on to this:-

First converting all the character into factor so that we can easily implement Statistics modelling function and also it would be handy to use them in plotting libraries like ggplot2

now we can see that categorical data are interpreted by R, when we look at the data you will see that ‘N/A’ is used for representing NA, if we did not change it R will not recognize it as a Missing value and we get error prone results.

##       Rank           Name              Platform         Year     
##  Min.   :    1   Length:16598       DS     :2163   2009   :1431  
##  1st Qu.: 4151   Class :character   PS2    :2161   2008   :1428  
##  Median : 8300   Mode  :character   PS3    :1329   2010   :1259  
##  Mean   : 8301                      Wii    :1325   2007   :1202  
##  3rd Qu.:12450                      X360   :1265   2011   :1139  
##  Max.   :16600                      PSP    :1213   (Other):9868  
##                                     (Other):7142   NA's   : 271  
##           Genre                             Publisher    
##  Action      :3316   Electronic Arts             : 1351  
##  Sports      :2346   Activision                  :  975  
##  Misc        :1739   Namco Bandai Games          :  932  
##  Role-Playing:1488   Ubisoft                     :  921  
##  Shooter     :1310   Konami Digital Entertainment:  832  
##  Adventure   :1286   (Other)                     :11529  
##  (Other)     :5113   NA's                        :   58  
##     NA_Sales          EU_Sales          JP_Sales         Other_Sales      
##  Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.00000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.00000  
##  Median : 0.0800   Median : 0.0200   Median : 0.00000   Median : 0.01000  
##  Mean   : 0.2647   Mean   : 0.1467   Mean   : 0.07778   Mean   : 0.04806  
##  3rd Qu.: 0.2400   3rd Qu.: 0.1100   3rd Qu.: 0.04000   3rd Qu.: 0.04000  
##  Max.   :41.4900   Max.   :29.0200   Max.   :10.22000   Max.   :10.57000  
##                                                                           
##   Global_Sales    
##  Min.   : 0.0100  
##  1st Qu.: 0.0600  
##  Median : 0.1700  
##  Mean   : 0.5374  
##  3rd Qu.: 0.4700  
##  Max.   :82.7400  
## 

Now we will check the consistency of the data, weather tha data inside a column is homogeneous or not, or the data inside column is fisible or not.

taking the mean of the differences between the actualSale calculated by summing up Sales from all countries to the Global_Sale Attributes we get

## [1] 0.0002765393

so from here we ca see that the Global_sale atrribute is not correct and has some error init since the value in revenue is in million dollars so there is significant amount which is entered false in the data lets change the value of the Global_sale with the sum of japansale,North America Sale,Europe Sale and others sale

the long tail in the graph clearly states that there are only very few games which have total revenue greater then 75.Most probably these are the most popular game, if not so it may be an outlier.Also we have to check for the duplicacy of the data

## # A tibble: 2,775 x 2
##    Name                         count
##    <chr>                        <int>
##  1 Need for Speed: Most Wanted     12
##  2 FIFA 14                          9
##  3 LEGO Marvel Super Heroes         9
##  4 Madden NFL 07                    9
##  5 Ratatouille                      9
##  6 Angry Birds Star Wars            8
##  7 Cars                             8
##  8 FIFA 15                          8
##  9 FIFA Soccer 13                   8
## 10 Lego Batman 3: Beyond Gotham     8
## # ... with 2,765 more rows

so here we can see that there are 2,775 videogames which are being published , surely these game must have great revenue thats why there are multiple release

in the next section we will analysis the trend and try to find the correlations and give ans to various Curious Questions too.